The AAC [Austrian Academy Corpus] - An Enterprise to Develop Large Electronic Text Corpora

نویسندگان

  • Hanno Biber
  • Evelyn Breiteneder
چکیده

The AAC [Austrian Academy Corpus] is a corpus research institution based at the Austrian Academy of Sciences in Vienna. The AAC is a very large and complex electronic text collection. Its aims are to create an innovative text corpus and to conduct scholarly and scientific research in the field of electronic text corpora. In the first phase of the corpus build up the AAC is committed to have at least 100 million running words of carefully selected and scholarly annotated significant texts. The corpus approach of the AAC will allow a variety of investigations into the linguistic properties, the textual structures and the historical and literary significance of the selected texts. In the second phase of application development the size of the AAC will increase to around one billion running words. In this phase selected subcorpora will be annotated in greater detail following the AAC schemes for annotation and according to its editorial principles. The AAC working group is endeavouring to establish a corpus that meets the needs of textual studies and conveys essential information about the German language as well as about the history of the time in focus as a history of texts and of language. 1. The AAC [Austrian Academy Corpus] The AAC is a corpus research institution based at the Austrian Academy of Sciences in Vienna. The AAC is a very large and complex electronic text collection. The primary aims of the AAC are to create an innovative and experimental text corpus of significant texts as well as to conduct scholarly and scientific research in the field of electronic text corpora. The texts selected for inclusion into the corpus date from the period between the 1848 Revolution and the fall of the Berlin Wall in 1989. The texts will be predominantly German language ones, but other specific parallel corpora and multimedia collections will be included. Figure 1: AAC AAC Corpus Build Up In the first phase of the corpus build up, which will be completed by the end of the year 2005, the AAC is committed to have at least one hundred million running words of carefully selected and scholarly annotated significant texts ready in digital format. At present, however, we have already approximately two hundred million running words to hand. The corpus approach of the AAC will allow a variety of investigations to be carried out into the linguistic properties, the textual structures and the historical and literary significance of these texts. In the second phase of application development the size of the AAC will increase ten-fold to around one billion running words at the end of the year 2010. In this phase selected subcorpora will be carefully annotated in greater detail. The annotation process will be following the AAC schemes for annotation and mark up, and it will be done according to the AAC’s editorial policies and principles. The AAC working group, who have had expertise in linguistics and literary studies, in computer supported lexicographic

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Words in Contexts: Digital Editions of Literary Journals in the "AAC - Austrian Academy Corpus"

In this paper two highly innovative digital editions will be presented. For the creation and the implementation of these editions the latest developments within corpus research have been taken into account. The digital editions of the historical literary journals "Die Fackel" (published by Karl Kraus in Vienna from 1899 to 1936) and "Der Brenner" (published by Ludwig Ficker in Innsbruck from 19...

متن کامل

Fivehundredmillionandone Tokens. Loading the AAC Container with Text Resources for Text Studies

The "AAC Austrian Academy Corpus" is a diachronic German language digital text corpus of more than 500 million tokens. The text corpus has collected several thousands of texts representing a wide range of different text types. The primary research aim is to develop text language resources for the study of texts. For corpus linguistics and corpus based language research large text corpora need t...

متن کامل

Comparative Study of the Academic Vocabulary Content of Electronic Engi-neering Corpora, GE Materials and M.S. Entrance Examinations

The importance of vocabulary learning has been underlined in the field of English for Academic Purposes (EAP) because non-English majors who require reading English texts in their fields of study have to expand their English vocabulary knowledge much more efficiently than ordinary ESL/EFL learners. Since academic vocabulary instruction in Iranian universities is realized through the use of Gene...

متن کامل

Hooking up to the corpus: the Viennese Lexicographic Editor’s corpus interface

The paper addresses the issue of interfacing between digital corpora and a new dictionary writing application being developed at the ICLTT (Institute of Corpus Linguistics and Text Technology of the Austrian Academy of Sciences). It deals with issues of dictionary creation, software design, usability and interoperability in relation to the example of this fairly new piece of software, the Vienn...

متن کامل

Clustering bilingual text corpora using mixtures of von Mises-Fisher distributions With an application to the corpus of abstracts of the Austrian Journal of Statistics Master Thesis

With the increase of availability of text corpora on the internet and specifically bilingual text corpora, the objective of this thesis is to develop an extension to the movMF model by Banerjee et al. (2005). This model was implemented using two algorithms, the EM algorithm and an annealing variant called the DAEM algorithm, which often yielded better results. With the aim of analyzing the corp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004